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A method of combinatorial cassette mutagenesis was 
designed to readily determine the informational content 
of individual residues in protein sequences. The technique 
consists of simultaneously randomizing two or three 
positions by oligonucleotide cassette mutagenesis^ select- 
ing for functional protein, and then sequencing to deter- 
mine the spectrum of allowable substitutions at each 
position. Repeated application of this method to the 
dimer interface of the DNA-binding domain of X repres- 
sor reveals that the number and type of substitutions 
allowed at each position are extremely variable. At some 
positions only one or two residues are functionally accept- 
able; at other positions a wide range of residues and 
residue types are tolerated. Tlie number of substitutions 
allowed at each position roughly correlates with the 
solvent accessibility of the \vild-type side chain. 



IT HAS BEEN MORE TllAN 20 YEARS SINCE AnFINSEN AND HIS 
colleagues showed that the sequence of a protein contains all of 
the information necessary to specify ihc three-dimensional 
structure (2). However, the general problem of predicting prorcin 
smicrufc from sequence ren^ains unsolved. Part of the difficulty may 
stem from the complexity of protein structures, Ailliough some 200 
protein structures dxt kriown^ no rules have cnricrgcd that allow 
struaure co be related to sequence in any simple fashion (2). The 
problem is further complicated by the nonuniformity of the siruc- 
rural information encoded in proccin sequences. Some residue 
posiricHis arc important, arid changes at these positions can tip the 
balance between folding and unfolding (i-7). Other residues are 
rclaiivcly unimportant in a structural sense and a wide range of 
substitutions or modiftcations can be tolerated at these positions (3> 
7-9), 

If only a fraction of the residues in a protein sequence contribute 
filgruficandy to the stability of the folded $ctvcnire, then it becomes 
important to be able to idendfy these residues. We now describe the 
results of genetic studies that allow the importance of individual 
residues in protein sequences to be rapidly determined. Specifically, 
we determine the spectrum of functionally acceptable substimtions 
at residue positions near the dimcr interface of the NH2'tcrminal 
domain of phage lambda (X) repressor {iO). The NHj-tcrminaJ 
domain birds to operator DNA as a dimcr, with dimcrization 
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mediated by hydrophobic packing of a hclbt 5 of or^e monomer 
against a hclU 5' of the other monomer {11) (fig. I, A and B). 
Without helix 5 there arc no contacts between the subunits (Fig. 
IC). By applying combinatorial casscac mutagenesis to the helix 5 
region, we find that the number and spectrum of allowable substitu- 
tions within helbc 5 arc extremely variable from residue to residue. 
In most cases, this variability can be rationalized in tenTJS of the 
fractional solvent accessibility of the wild- type side chain. 

General strategy. For our smdies, wc used a plasmid-bomc gene 
that encodes a functional, operator-binding fragment (residues 1- 
102) of \ repressor (12). The binding of the 1-102 fragment to 
operator DMA depends on dimcrization which, in turn, depends on 
the hclbc S-helbc 5' packing interactions 13), Thus, if a 1-102 
protein rccains normal operator-binding properties, we can infer 
that it is able to dimcrizc normally. 

Mutagenesis of the helix 5 region was performed by a combina- 
torial cassette procedure. One example of this method, in which 
codons 85 and 88 are mutagenized, is illustrated in Fig. 2. On the 
top strand, the mutagenixed codons are synthesized with equal 
mbcnircs of all tour bases in the first two codon positions and an 
equal mixture of G and C in the third position. The resulting 
population of base combinations will include codons for each of the 
20 naturally occurring amino acids at each of the mutsigenized 
residue positions. On the bottom strand, inosinc is inserted at each 
randomized position because it is able to pair with each of the four 
conventional bases (M). The two strands arc then annealed and the 
mutagenic cassette is ligated into a purified plasmid backbone. 

To identify plasmids encoding fiincdonal protein, wc selected 
transformants (br plasmid-cncodcd resistance to ampicillin and for 
resistance to killing by el' derivatives of phage X. The latter selection 
requires that the cell express 1-102 protein that is active in operator 
binding {J J). For each mutagenesis experiment, many independent 
transfbntunts were chosen, single-stranded pUsmid DNA was 
purified, and die relevant region of the 1-102 gene was sequenced. 
The resulting set of sequences provides a list of fiincuonaliy 
acceptable helix 5 residues. 

Substitutions in the helix 5 region. In separate cxpcrimencs with 
different mutagenic cassettes, the codons for helbc 5 residues 85 and 
88; 86 and 89; 90 and 91; 84, 87, and 88; and 84, 87, and 91 were 
mutagcnized, and genes encoding active 1-102 proteitis were 
selected. In some cases, the survival frequency was low. For example, 
only 17 of 60,000 transformants passed the selection. after random- 
ization of codons 84, 87, and 88. In this case; each active candidate 
was sequenced. By contrast, 1,200 of 50,000 transformants passed 
the sclccdon in the mutagenesis of positions 86 and 89 (16). In this 
case, we picked 50 candidates for sequence analysis. Overall, ISO 
active genes were sequenced (Table 1). In addition, wc sequenced 
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^^^^^ , . •rgdfiwtfjk'ifiad^m mutagenic : tiiWfi<6rtctrnbgfficstrucniral requirements at various positions in 

'fe^ltiiic^^ helix 5. We now consider these residue positions in order of 

pf vniumgeheis and also provide examples of helix 5 mutations lhat decreasing "informational content*" where this terra is roughly 

result; in inactive 1-102 proteins (Table 1). defined as a vahic that decreases as the number of allowed subsritu. 

, Manjrtrfthc active sequences c ntain at least tw residue changes tions increases. Thus, the infonnational content ofa residue position 

compared to wild type. In principle, some of these changes could be is highest if only the wild-type amino add is aLowcd and is lowcsc if 

compensatory; for example* residue X might be functionally allowed each of the 20 naturally occurring amino adds i$ allowed, 

at position 85 only in combination with residue Z at posidon 88, Positions 84 and 87 in particular stand out as having a high 

This cannot be generaQy true, however, because most residue informational content. He appears to be the only acceptable residue 

changes at one posidon were recovered in Combination with several ac position 84. Both Met and Leu are residues of similar size and 

di£crcnt changes at the other posidon or positions. It is therefore hydrophobictcy, and arc the only two residues that appear ro be 

likely that most substitudons that are ftincdonally acceptable in functional at position 87. The side chains of Ue*^ and Met^ form a 

multiply mutant backgrounds would also be allowed as single major pan of the helix-helix packing interaction at the dimcr 

substitutions. In Fig. 3, wc show the spectrum of fimctionally interface, where Ile*^ of one subunit packs against Met**^' of the 

acceptable substitutions at residue positions 84 to 91, other subunit, and vice versa (Fig. 4). This cluster of four residues 

From the list of allowed substitutions, several conclusions may be also contacts the globular portions of the domain. Solvcnr accessibil- 
ity calculations by the me^od of Lee and Richards (77) show di« 

Table 1. Sequences for the helix 5 region of active and inactive mutants the Ilc*^ and Met"*^ side chains arc almost completely buried (92 co 

ooKUiicd by combinatonal Cassette mutagenesis. Active murants arc resistant oo *v**.>.a^.. ..yvU, -ui \ .-u r . w» 

.^^K*«vi-ucj .k^*^ > ju ^ -.1. iij '"^w*^"*" ys percent solvent macccssibic) m the structure of the dimcr Wc 

ro phage XKH54; these arc grouped by cassette, with the wild-cypc sequence ^ . , n m « ^ 

ac the top of each group and randomized positions jn boldface, Asttriski assume char replacemerit ot He or Mct^' with smaller side chauis 

indicate sequences of mutants obtained in the absence of a ftmciionaJ would diminish dimeri^ation because hydrophobic and van der 

selection. The activity of these mutants was subscqoenrly dcrcmiined by a Waals interactions would be losc In fact, mutant repressors contain- 

ficrten. Numbers next to sequences indicate die number of tmtcs particular Scr^^ or Thr«^ arc defccrivc in dimerization (O, 18). Rephcine 

mutanr sequences were obtained. Numbers at the tops of the columns ti°^ kj jp i -j i. . \ t'*«'-^^e 

indicate amino acid positions. The one-letter abbreviations for the amino T ^^^^ residues would also be expected to be 

acids art: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; Giy; H, Hii\ I, Ilc; detrimental because substantial structund rearrangements would be 

Lys; L, Ltu; M, Met; N, Asn; P, Pro; Q, Gin; R., Arg; S, Scr; T, Thr; V, required to acconunodatc larger side diains. 

Val; W. Trp; and Y, Tyr. ^^^^ Scv^n residues (Leu, lie, Val, Thr, Cys, Scr, and Ala) arc 

functionally acceptable at position 91. Aromatic residues, charged 
residues, and strongly hydrophilic residues arc not found- Tlic wSd- 

85 90 type Val side chain is partially buried in the dimcr structure, with the 

L--* C'y2 mcdiyl group packing against the C61 mcdiyl group of die 

_ "E - ' lic^^ side chain . Aldiough some of the acceptable substitutions such 

— E — G — * as He and Thr could make equivalent packing contacts, odiers such 

G E— 5^.j. could not. 

lYEM^EAV ^i"*^ residues (Trp, His, Met, Gin, Leu, Val, Scr, Gly, and Ala) 

„^ 4 ii'e acceptable at position 90. There is a surprisingly large range in 

WL* li^th the acceptable size and hydrophilicity of these side chains* This 

ws is especially true as the C3 methyl group of the wild-type Ala is 

6 alwst completely buried in the strucmre of the dimcr and, at first 

AC 2 glance, it would appear that larger side chains could not be 

A^L 4 accommod;ited. However, the inaccessibility of the methyl 

"""at 2 ^^^^ ^^^^ i*rgcly caused by die Lys*^' side chain, which packs 

^_VA agr^irtst it. By rotating the Lys*^' side chain away, we were able to 

— ^_vc 2 introduce a Trp^ side chain by model-building without stcric 

MA^ clashes. Rotation of tile Lys*'' side chain away from Ala^ should 

-Z-Z--QJJ not be energetically costly and, in fact, is observed in crystals of die 

QT NHi-tcmiinal domain bound to operator DNA (i9), 

SV 3 Nine different residues (I'rp, Tyr, Phe, Mct» He, Val, Cys, Scr, and 

IIIIIIsL I ^ functionally acceptable at position 88. There arc large 

ST ^ variations in the fii^e$ and volumes of the acceptable side chains, 

(5C although most arc relatively hydrophobic. Charged residues and 

other strongly hydrophilic residues are not obscnxd. In the wild- 
type dimer ( n), the aromatic ring of Tyr^** stacks against the ring of 
Tyr****. The side chains of Trp, Phc, Met, lie, and Val could probably 
form some type of packing interaction at this position, aldiough 
those of Ala and Set could not. It is known diat the presence of Cys 
at position 88 allows a stable Cys"®-Cys*^' disulfide bond, which 
links the monomers in a conformation that is active in operator 
binding {20). 

Positions 85^ 86, and 89 show considerable variability. At each of 
these positions, 13 different amino acids were found co function At 
positions 85 and 86, aromatic, hydrophobic, polar, and charged 
residues arc all acceptable Ai position 89, aromatic residues were 
not represented, but each of tiie remaining classes was observed. In 
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Pin 1 Thr« views of the DN A- bindinLr domain repressor. Showing die rotated 90" from the view in (A), 10 $how the "back side'' of the molcailc, 

rofcoSstrdt;^ pV<=dcomp!cx%f repr^rlncr (C) IW with hclbc 5 of each monomc^^^^^^^^ 

with npemot DNA (H). Hclk 5 of cacrmonomci U colored more lightly tolc hdix S playi in mcdlHing dimcrixation {26). 
rhan the globular pofiion of that monomer. (B) Free repressor dimcr, 
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inosirxc was used. After synthesis, the oligonucleotides were phOsphOryUwd, 
artncMcd, apd Ugatcd into chc Xho I-Sph I backbone of plasmid pJOl03. 
Plaimid pJOl03 is an M 13 origm plasmid with dlC i-lO^ gene under 
control of a «tc pionioter; the region of the 1-102 genc encoding residues 
82-93 {the small Xho I-Sph I fragment) is replaced by an unrclMcd l.9-fcb Pus ^ 
Xho I-Sph I "swffcr" fragrYicnt. Ugatcd DNA was tnuisfoitncd inro 
listiiCrichia ri^U strain X90 T'hcV^ cells (27), and amp ici II in- resistant colortics 

Tu^^^rLkSnCrrj-JS^^^^ Sf vii^STriv^^^ of ph»gc . to c^nfin. rhcir ~ty preppies Ist^ns «d mc*ods,- dc- 
the didcDxy method (2?). 



XhD t 



5phl 



the wild-rype dimcr, th*; side chains of Tyr*^ Glu«^ and Glu"* arc 
relatively solvent accessible. 

Several amino ^cids arc significandy underrcprescnted among the 
aaivc sequences. For example, Pro is never found. This cannot be an 
artifaa of our mutagenesis procedure because Pro 13 frequently 
observed among the unselectcd mutant sequences (Tabic 1). Wc 
conclude diat Pro is not found among the functional scqiicnccs 
because it is selected against; its presence would presumably disrupt 
the a-hclical stmcturc and dicrcby the helix-helix packing at the 
dimcr interface. 

His, Asn, and Lys arc also underrtpresentcd among the funcrional 
helix 5 sequences. These residues arc presumably not acceptable at 
positions 84 and 87, where the informational content is extremely 
high, and may not be acceptable at positions 8S and 91, where the 
functional subsdtutions are generally hydrophobic in character. The 
acceptability of these residues at positions such as 85 and 86 is 
difficult to assess from our experiments because the codons for these 
residues arc present at reasonably low frequencies even among the 
unscleaed sequences. In these cases, wc probably have not se- 
quenced a large enough number of candidates to be confident that 
qll acceptable substimtions have been identified. In fact, data from 
ttvcrsion studies (21) and suppressed amber studies (22) show that 
His*^ and Lys^ arc acceptable substitutions in the context of the 
intact X. repressor molecule. 

Informational concent and protein stmctore. We have com- 
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Fig, 3. Functionally acceptable residues In the helix 5 region- The amino 
aeids arc listed from top to bottom in order of increasing hydrophobicity 
according CO the scale of Eiscnberg et ai. {30), 



bJrtcd an cffidenr combinatorial mutagenesis procedure and a 
fiinctional selection to probe the informational content of the eight 
residues that form the major part of the dimcrization interface of the 
NHrtcrminai, operator-binding domain of X repressor. At two of 
tiwsc eight residue positions, die functionally acceptable choices arc 
highly testricted. For example, wc analyzed 17 functional genes in 
which codon 84 had been randomized and recovered die wild-type 
residue, lie, in every case. This is clearly a position of hi^- 
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Rg. 5« Corrclacioii between rhc solvent accessibility and the number of 
fiinctionally acceptable substitutions. Hatched bars indicate the percentage 
of the 30 naturally ocaifring amino adds thai arc functionally acceptable at a 
residue position. Black bars indicate the fracdonal solvent accessibility of the 
wild-tvpc side chain in the dimer. Solvent acccssibilidcs for the NH;- 
tciminal domain dimer were computed using a 1.4 A probe by the 
method of Lee and Richards (17). Fracdonal acctsiibilitics were obtained by 
dividing by the appropriate side chain accessibilities calculated for the 
monomer. The fractional accessibilities change only slightly if the side chain 
aCccKibiiitics in the reference iripeptidc Aia-X-AJa (H) arc used instead a£ 
the reference state. 

informarionai content. The informational Content is also high at 
position 87, where Met and Leu arc the only acceptable residues. By 
contrast, the remaining positions have moderate to low informa- 
tional contents. For example, among 38 functional genes in which 
codon 85 had been random iy^d, the wild-type residue was recovered 
only once, and 12 other residues, differing in size and chemical 
properties, were recovered in the remaining cases. This is clearly a 
position of low infomiational content. It is striking that most of the 
structural determinants of dimcrization in this eight-residue seg- 
ment reside in two residues onIy< The remaining positions arc 
surprisingly tolerant of a wide range of substitutions. If this high 
level of tolerance is generally true of protein sequences, then the 
problem of understanding and prediaing structure may rest largely 
on the abilit)' to identify chose few residues that arc crucial. 

The positional variability of the informational content in helix 5 
can, in genera!, be rationlized in terms of die solvent accessibility of 
the wild-type residues in the crystal Structure {11), There is a rough 
correlation between the number of acceptable substitutions and d\c 
fractional extent to which die wild-type side chain is solvent 
accessible (Fig. 5). At exposed surfiace positions such as 85, 86, and 
89, we find that- many different residues and residue types can be 
functionally accommodated. By contrast, at positions such as 84 and 



^>87i^«icfc the Wild-^p? side diajri is almost completely burieci^ we 
find diat the fiinciionally acceptable residue choices are cactremcly 
■restricted There is one apparent exception to die simple rule diat 
buried it^ducs are high in informational content. Ala^is inaccessi- 
ble to solvent in the crystal structure, and yet wc find diat many 
substituoons are allowed at this position. However, the inaoe&sibi- 
iity of the Ala^ side chain to solvent is not due to dose paddiig at 
the dimer interface, but rather to an interaction with a nearby 
surftce side chain. This side chain Can presumably move to allow 
larger side chains to be accommodated at position 90. Examples of 
this type demonstrate die need to distinguish between two types of 
buried side chains: those that can become exposed by relatively 
minor rearrangement of other side chains, and those chat arc tighdy 
packed in the hydrophobic core. 

There is no reason to assume that there should always be a strict 
correlation between the solvent accessibility of a residue and the 
structural informadonal content of that posidort. For one thing, the 
chemical properties of the 20 amino adds are not related in any 
simple linear fashion. Moreover, the structuraJ importance of some 
residues in proteins almost Certainly stems from intcracdons other 
than simple hydrophobic packing. Nevertheless, the closely packed 
nacuTt of protein interiors (23) provides a simple molecular explana- 
tion for the stnicrural importance of buried residues, and destabiliz- 
ing mutations are commonly found to affect hydrophobic core 
residues (3-7). By contrast, missense mutations or chemical modifi- 
cadons that affect surface residues are often found to have licde or no 
influence on protein stability (3, 7, 8). Thus, it is reasonable diat 
solvent accessibiiiry should be an extremely important determinant 
of the informational content of a residue position. 

Our overall strategy for rapidly probing informational content 
should be broadly applicable to a wide range of protein stnicrurc- 
hinction problems in systems where genetic selections or screens can 
be devised. The method consists of three basic elements; (i) the use 
of Cassette mutagenesis to introduce extremely high level? of target- 
ed random mutagenesis; (ii) the use of a fUnaional selection to 
identify genes encoding active proteins; and (iii) the use of rapid 
DN A sequencing methods to determine the spectrum of flincdonal- 
ly acceptable residues in a relatively large number of candidates. Our 
method of combinatorial cassette mutagenesis (Fig. 2) allows several 
residue positions to be mutagenizcd at the same time and, in 
principle, generates a mutant population in which each of the 20 
an^ino acids is represented at each mutagcnizcd position {24). When 
iwo or three codons arc mucagenissed at the same time, the entire 
analysis is able to proceed more rapidly. Moreover, at this level of 
muragenesis most two-residue and three- residue combinadons 
should be present in the mutagcnized populadon and should be 
recovered if thc>^ result in a functional protein. In our study of the 
packing of the 84 and 87 side chains, we recovered only two (Ilc^ 
>vith Met^' and He*" widi Lcu^^) of the 400 possible residue 
combinations. Thus, because both positions were mutagenizcd in 
the same experiment, wc arc able to conclude that dicre arc not 
significandy different ways of packing the dimer interface. 

In prindplc, data like that shown in Fig. 3 could be generated for 
an entire protein sequence, and additional experiments could be 
devised to determine whether the positions of high inibtmational 
content were important for structure or function. For proteins of 
unknown structure, such data might be quite uscfiil for structural 
prcdiaions- First, current predictive algorithrns could be applied to 
the family of related sequences generated by our method, as each of 
these sequences is able to form the same basic structure. Second, 
because of their fiindamcntal repeats, a-hclical and |}-strand regions 
might be recognized by characteristic patterns of high and low 
informational content. Third, the positions of highest structural 
informational content should indudc the residues involved in 
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formation of the hydrophobic core of the protein. This informacion 
might prove useful in combination with the tertiary template ideas 
recently proposed {2S). 
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